#### 1.1. VisualSim based system modeling process case



# **1.1.1** Affairs excitation source

Only one excitation source of the affairs of the model, namely ADC Source module. The module system front-end ADC sampling process modeling.

The excitation source model is also based on the different systems work mode to determine the CPI cycle, each CPI contains a number of PRI. Transaction output of the excitation source in order to schematically the system's I and Q signal path, the I and Q channels.

#### 1.1.2 **Transaction data structure**

In VisualSim, the user can define the format of any form of business transactions, and given to DS. Designed according to the needs of the present model, the transaction data structure (portion), as shown in Table 2. The entire contents please refer to Annex "C2 DS.txt".

| Table 2     |            |               |                      |  |
|-------------|------------|---------------|----------------------|--|
| Data Item   | Data Types | Default Value | Remark               |  |
| ID_CPI      | Int        | 1             | /*CPI No*/           |  |
| ID_PRI      | Int        | 1             | /*PRI No*/           |  |
| PRI_Length  | int        |               |                      |  |
| PRI_Samples | int        | 1900          |                      |  |
| Channel     | string     | Sum           | /* Sum or            |  |
|             |            |               | Difference */        |  |
| I/Q         | string     | I             | /* I or Q channel */ |  |

| Table 2 | Τ | `abl | le | 2 |
|---------|---|------|----|---|
|---------|---|------|----|---|

| S/M | string | ST | /* ST or MT */ |
|-----|--------|----|----------------|
|     |        |    |                |

1.1.3: Model Parameters:

The model parameter set defines the top level of the system working mode (Mode), DSP core operating frequency (TS201\_CCLK\_MHz), and two data buffers coupled to a threshold value. More model parameters defined in each module will be described later when needed.

1.1.4: Flow modeling system behavior

Behavior of the model flow modeling is in full accordance with the given system workflow diagram conducted. Of course, not static behavior modeling functional block diagram, each behavioral model of each functional module is a specific internal control flow, data flow, trigger mode, data flow coupling methods specific model design ideas embodied. Hereinafter, the present model the behavior of flow model module to do the necessary instructions.

#### Module Buffers1:

On-board high-speed acquisition of the ADC data cache modeling, parameter Buffer\_Size\_in\_PRI used to determine the size of the data cache capacity (in an amount of data PRI sampling point as the basic unit). Module receives a predetermined number of data (determined by the parameter Threshold\_in\_PRI), will request the system transmission resources (currently defined as 64-bit/66MHz PCI bus), to transfer data to the system FPGA board.

#### Module Buffers2:

The data on the FPGA board cache and data forwarding function modeling, equivalent to the function of a document. And Buffer1 similar to the predetermined number of received data (determined by the parameter Threshold\_in\_PRI), requests the specified LINK port, through the LINK point to the specified data to the DSP.

Figure 2 is a block diagram showing the internal structure of the module Buffers2.



Module F2:

Document function 2, completed in the specified DSP digital pulse calculation is purely computational resources consumption tasks. Module Num\_Fixed2Float\_a\_Cycle, Complex\_Multiplication\_Cycles, IFFT\_factor and other parameters used to set the function of the actual computing power.

#### Module F3:

For the completion of the calculation results of the digital pulse transposition process, and transmitted to the next functional block.

Transpose the matrix is designed to implement a DRAM subsequent operations the "order" to improve access performance. As TS201 DSP chip memory as DRAM, so the storage matrix transposition process will occur every time a data for a page (page) case, then the prefetch cache will fail, this cross-page access slower than sequential access 6 -8 times. Therefore, set the parameters in the module of page\_access\_cycle to right across the page to access model.

In the modeling process, we found that the more functional three behavior is not worthy of study transpose operation operation itself, but rather function 3 and function 4 coupling relationship between the data. Therefore, we are involved in F3 data transfer operation designed a variety of possible scenarios, found that different programs a great impact on the system. In the present model, mainly two different designs: ① centralized transmission, that all the PRI in a CPI transposition completed, the DRAM seen the results for the transmission processing time; ② Distributed Transmission: Function 2 each PRI after several rows processed data is sent to the function 4, transposition process actually done in the function 4. Case F3 module for changing parameters in different implementations, and the parameter Trans\_Threshold indicates distributed transmission scheme in a few lines of each dispersion PRI data transmission. Module F4

Completed part of the matrix region windowed FFT calculation. Module F4 according to different system working mode is automatically selected area.

In a deployment model, the first clear F4 pipeline processing module has CPI (CPI or serial processing) behavioral characteristics: As F4 in windowed FFT calculation 64 PRI to receive sufficient data to begin, so in each CPI frame period, F4 will be two tasks simultaneously: ① the current CPI frame, completion of certain data PRI transposed row calculation; ② CPI to the previous frame data have accumulated since transposition good 64 PRI data, these data directly to complete the windowed FFT.

Since CPI in a cycle, in the F4 (more precisely, on the same processor) to accomplish two tasks simultaneously, so the modeling is to examine two key tasks in the DSP will occur or that overlap in time task competition. For this reason, these two F4 modeling tasks designed several different tasks deployment scenarios, and use Case, Num\_of\_Series, Num\_of\_Series2 program control and other parameters.

Module F4 is also very important to conduct a model, which significant changes frequency of transaction transaction: a transaction is entered F4 CPI cycle every 64 (64 PRI); the output transactions through F4 are each CPI cycle occurs 800 times (on the

system operating mode 2 is 8 times).

Figure 3 is a block diagram showing the internal structure of the module F4.



# Module F5

Documentation function 5, completed part of the IDFT matrix data operations. Function 4 and 5 are the data received from the function 3. In the behavioral model, the module will be based on the system F4 and F5 and F3 output mode is automatically selected to receive PRI number corresponding transactional matters. F5 module and the module is running F4 similar behavior. After the module F5, the output of the trading volume becomes 800, or 8 per CPI cycle or 400 times the operating mode of the system.

Modules F6:

Documentation function 6, F4 and F5 from the module receives transaction data, complete the complex modulus calculation.

According to a given deployment, functions 5 and 6 share the same function DSP. Therefore, the operation of the data part of the function 6 is local (produced by the same DSP), and the other part from the outside (4 completion DSP) incoming transmission. Two parts of the timing relationship between the sequence of the data stream, and the processing tasks can be deployed in different programs, the module 6 in the modeling.

Module F7:

Function 7 received a complete frame period CPI data before processing can begin, so the front of the module to set a depth of F7 event queue 800 to the functional model input buffer 7.

# Module F8

Achieve static and dynamic target separation and ST and MT data distribution to the corresponding processing function up. F8 in the module, we use a random number within a range of 12-16 to determine the ST data size, it is more consistent with the actual system behavior of the model.

### Module F9-F12

Corresponds to the document function 9-12, these functions are relatively simple, and its deployment scenario coupling between the data required is relatively simple. These are the cases of a system behavior modeling point, it should be said case a complete description of behavior is more complex, and can reflect a more comprehensive method of transaction-level system modeling behavioral process modeling.

#### System resource modeling

Model system resource modeling is in full accordance with the given hardware architecture diagram conducted. System resource modeling focuses on the problem based on the model to select reasonable model level of abstraction for modeling the system hardware resources.

The top of the model, the system resources, including the PCI bus and four processing module, the actual system M1 to M4.

PCI bus resource model defines PCI\_Bus\_MHz, PCI\_Bus\_Width\_in\_Bytes, PCI\_Burst\_Length, PCI\_Arbitration\_Delay\_Factor and other parameters used to reflect the system's PCI bus data transfer performance.

Each processing module within the resource modeling includes four TS201 processors, each of the two processor group consisting of Cluster Bus, and each Cluster Bus and PCI bus between the PCI bus bridge. Figure 4 shows each of the processing modules within the resource modeling diagram.



Resource model TS201 DSP processor modeling, is access to a lot of TS201 product document, modeled after a reasonable abstraction done. Figure 5 shows the TS201 DSP resource modeling structure. TS201 resource model is hierarchical modules each module, with more detailed modeling information, there is no longer expanded description. **Schema mapping distributor (Dispatcher)** 

# Upon completion of the system behavior modeling and resource modeling after a very important step is the system behavior modeling and systems to establish a clear relationship between resources, which is the system architecture definition.

VisualSim tools provide a variety of flexible ways of expression system architecture, such as "virtual" connections (Virtual Connection) and dynamic "map" (Mapping), ie without using the actual physical connections between the components and the establishment of a connection between trade flows.

In order to enhance the flexibility of the system model schema definition, in this model, we define a schema mapping independent distributor module, which Dispatcher. The map allocator to use Database functions from an external csv file reads each task resource mapping as a basis for the behavior of the system model and resource model connected.

Note Dispatcher is a "pseudo" System devices because does not exist in the actual system of such a device, thus requiring that the modules can be processed concurrently from a plurality of different behavior module assignment issued request transaction, the transaction stream in the system does not add extra processing delays.

Modeling Dispatcher module in the model, we can modify the csv file to easily change the system architecture configuration. This case and the case of two different deployment, configuration files are defined by the csv. For details, please refer to the attached csv file.



# 1.1.7: Model outputs

This case is not made very clear model problems (Model Questions), so we customized the model some typical model output.

#### System operating conditions Figure

Choose to define the current model in FIG operating conditions including the module (M1-M4) in the four DSP,  $2 \gtrsim$  Cluster Bus, and the DSP LinkPort overall operating conditions. About operating conditions for further explanation see Figure § 3.3 instructions.

As found in the modeling process to access the PCI bus conflict situations, thus gives the PCI bus operating conditions map for research and analysis. Figure 6 shows a deployment diagram working conditions under the PCI bus.



Transaction transaction records check

Use VisualSim tool can monitor any node in the system model transactions that occur transactions.

For example, the model inputs and outputs of the module F4 significant change transaction transaction: a transaction is entered F4 CPI cycle every 64; the transaction was the output F4 CPI cycle occurs every 800 times. When viewing window to monitor transactions F4 input and output, we can get a record of all transactions on the node, given the record of each transaction identifier ID, and transaction events flowing through the node added to the transaction record time .

Time of the transaction is a transaction-level simulation model of one of the most important intrinsic properties, through which we can easily calculate the characteristics of the system model between any two nodes in the end to end delay (end-to-end Latency), irrespective of whether from the start node to the end node between what kind of complex flows and trading operations.

# System processing Latency

Based on transaction time characteristics, the system model is given in a module F12 CPI data after completion of all processing time, that the entire system from the ADC starts the first sampling point to the frame CPI data processing tasks to complete the final one full point system processing delay.

Full system processing delay may be subject to all the factors system solutions: a processor bus speed changes; software deployment plan changes; data buffer changes the threshold value so. By VisualSim system model can quickly understand the design elements of the system-wide change in the impact of processing delay.





#### **Resource Utilization Report**

Currently the model for each module in the simulation period, the average of each DSP processor utilization statistics.

Although this model does not use a dedicated power modeling toolkit VisualSim for system power modeling, but the use of the model can be given by processor utilization system design power estimation: The sum of the utilization of the system all the DSP as an indicator of system power consumption, compared to the characterization of various scenarios to determine the value of the system under different design and low power consumption.

In addition the system model also gives high speed ADC acquisition board, FPGA board on the data cache buffer usage reports can be designed in order to assess how much buffer capacity is more appropriate to avoid wasting memory resources (especially for FPGA design) and buffer overflow problem.

# Case a deployment method 2

And deployment of the CPI program a pipeline processing (or serial CPI processing) to do comparison, we also conducted a parallel processing scheme CPI behavioral modeling. In the behavioral model, the module F3 to F6 to adjust and define a new csv schema mapping configuration file (C2\_Flow\_02.csv).

#### Case a model of execution results: ESV

In this model, the output from the system operating conditions figure, you can clearly see the system for each DSP, every Cluster Bus, each LinkPort, PCI bus are working at different times, in the implementation of what tasks can be seen to the system software running timing relationships, and so these systems clearly show the working status of the ESV is a direct manifestation.

Below deployment 1 (CPI serial processing), and channel processing, the mode of an operating condition in case diagram as an example, the necessary condition chart interpretation.

### Module 1's steam map

See the documentation front Error: Reference source not found. M1 timing relationship is relatively easy to understand. First seen in the work mode 1, the first of each CPI 1,2 and 67,68 a PRI PRI system is not for data processing.

Figure 8 shows some noteworthy details: For M1 on each DSP, data from the FPGA via LINK incoming DSP, DSP start pulse pressure after treatment. When the transmission of the threshold parameter F3 Trans\_Threshold = 1 时, DSP3 process is completed first active PRI (PRI # 3) after the pulse, starting from the DSP via EP-Cluster Bus to PCI Bridge M2 transmit data. The reason why a PRI pulse processing is completed will see the Cluster Bus on two consecutive transmissions, because according to a deployment requirements, 800 points pulse processing results to sub twice 400 points respectively in M2 is sent to the DSP 2. The reason that a spacing between the second transmission, because the PCI bus data throughput caused slower than Cluster Bus: Contrast Figure 9, the time of the PCI bus transfer is continuous.





When the PRI is 69-88 handled behind this 20 PRI, the processing result is 800 points once the transfer is complete and no longer points twice, but each transmission time occupied by Cluster Bus lengthened doubled. Shown in Figure 10, can also be seen from the figure because they do not deal with the first 67 and 68 and a PRI DSP3 and DSP4 are formed in the idle state.



The parameters change Trans\_Threshold F3 4, in Figure 11 can clearly see Cluster Bus

transactions on the bus position changes, completed four times each DSP pulse per treatment, through the PCI bus immediately split twice the  $800 \times 4$  data transfer up to M2. At a time when the four Cluster Bus transaction looks uneven intervals reason is due to appear on the PCI bus access violation caused by certain.

From Figure 12 can also be seen the end of each cycle of the CPI is only transmission of a PRI, it is because when the transmission 69-88 behind this processing result of the PRI 20, each of DSP to transfer 5 (20 / 4) PRI data, when the transmission threshold is set to 4, then each DSP on the actual transactions as twice, first to 4 PRI, the second for the remaining one. The model of behavioral modeling these details are not necessary, but if we require ESV as the system in the form of the statutes (Specification) definition, after the execution of these models gives the overall condition of the system diagram and project development team communication between staff is very meaningful! See Figure 12 Figure pipeline rupture conditions (5.7ms nearby) is due to the PCI bus access violation caused, § 3.4.1 will be discussed further.



#### 1.1.9: Module 2's steam map

M2 timing relationship is much more complex than that of M1. Therefore M1 is only responsible to complete the system functions 1, 2 and 3 as part of the M2 to complete all the remaining functions.

Trading on the M2 Cluster Bus began, namely M1 through QL5064 PCI bridge data from the PCI bus transfer to M2, Cluster Bus on, Cluster Bus after completion of the transaction on that data has been placed TS201 chip DRAM, ready good start next task. Interpretation of the operating conditions of M2 map is a very interesting work, due to the behavior of the system model F3, F4 designed in a variety of different functions implementations would be, so there are many different operating conditions change figure can be compared analysis, examine the actual operating results of different options. Figure 13 shows the case of M2 operating conditions map. The figure identified with some comments to help us understand the M2 between tasks running on the complex relationship between the trigger.



Interestingly, Figure 13 illustrate some ways the model itself can be the difference between a static document: Although this figure is VisualSim executable model of an implementation of the results, but in this document, it is a static icon, so we must use annotated identity in order to understand the relationship between the various operating conditions. But when we use the actual model of the system to communicate or analysis discussions can through "in the model 'prohibited' and then perform some tasks" approach to explain the timing of the causal relationship between tasks, or redefine a resource (e.g. DSP3) a condition diagram shows in the Y direction so that the function is called ID. Using a similar approach to VisualSim executable model is very fast.

1.1.10: Deployment conditions Figure 2

Figures 14 and 15 are given deployment two cases M1 and M2 operating conditions map. As can be seen from the figure, the CPI each complete process is M1 or M2, independently, so the module does not pass between the EP-Cluster Bus data transaction process.





Case of a model for implementation of the results of analysis and discussion points

# PCI bus access violation

By way of a deployment in designing the system, PCI bus will be occupied by two tasks: <sup>①</sup> Function 1, ADC high-speed sampling board needs the data in the cache buffer sent to the FPGA via the PCI card; <sup>②</sup> function 3, M1 module completed the pulse process data to be transposed processed and sent to the M2 through the PCI for subsequent processing. Therefore, PCI bus access may occur on the competition.

Affect the function of an occurrence of a PCI access frequency, time factors are mainly: ADC sampling frequency of the board, as well as on the board's cache ADC threshold, that is the number of sample buffer after a PRI PCI transfer. Affect the function of three of the PCI access frequency of occurrence, the time factor by functional three tasks to achieve program decisions: when to transpose? When transmission? Transmission of data and how much time?

Due to the combined effect of several factors such visits to the PCI bus, PCI with designer manually deduction is very difficult working conditions, and the system model of the PCI steam map directly answer. Figure 16 is a transmission buffer threshold ADC board 22 (in units of PRI, the same below), FPGA cache transmit threshold value 2 of the



PCI condition; Figure 17 is a transmission buffer threshold ADC board 2, FPGA transmit buffer threshold is 1 the PCI-condition

PCI bus access occurs on competition itself will not cause data loss, but will inevitably lead to delays in the implementation of a task to be transmitted. And if the PCI transfer function 1 is delayed to a certain extent, resulting in the PCI FPGA board receive buffer space has been taken, the system will no data for subsequent processing, data flow break, the system is an error. In this model, this anomaly is reflected in the working conditions figure: Figure PRI ticks condition rupture.

F3 in the design of the transpose function, our initial design is relatively simple: the PRI in a CPI transposition of all completed, the results of the DRAM in foreign transfer processing time. However, the simulation model of the system boundary in the CPI period that a very significant number of flow break condition, as shown. System model execution results clearly tell us that this design is not feasible.



So we redesigned the alternative decentralized transmission scheme: Function 2 per PRI after several rows processed data is sent to the function 4, transpose function 4 treatment actually completed. The program determined by the system model validation is working properly.

1.1.12: Processing and transmission of data between the coupling

In the system function 6, to 64 points on the 800 groups windowed FFT calculation result modulo processing (processing task), 64 points in each data modulo arithmetic processing is completed immediately after the results to feature seven designated DSP, transfer process between the DSP using the corresponding LinkPort (transfer tasks).

Examine the results of the model, you can see the processing tasks and transfer tasks coupling between the data before impact on the system: In a typical parameters, the processing task is much slower than the transport task, in Figure 19, the first set of processing tasks completed starts LINK transfer tasks, and LINK a start immediately begin the next processing task, which is similar to starting LINK DMA began after the next set of data processing tasks. Figure 19 shows the execution timing design is the most efficient way, full use of DMA data processing and data transmission in parallel. In an actual implementation can use Ping-Pang memory blocks way to achieve

Figure 20 is another form of realization that a serial processing tasks and transfer tasks performed: the first set of data processing task is completed, use the LINK start the data

transfer, while the first set of data transfer is complete, and then start the next set data processing tasks.

Figure 19 and Figure 20 is a timing task "micro" view, these two implementations impact on the system, what is it? See Figure 21 and Figure 22 below contrast. Processing transfer tasks serially cause the entire system to handle the task completion time after the shift, the results contrast we can see two ways of processing delay difference of systemwide reach more than about 500 microseconds.









System-wide processing delay:

By comparing different software design results of the implementation of the model, it is

clear that the design scheme of the whole system processing delays.

For example, the deployment mode 1, ADC samples the board's cache threshold setting processing delay of the whole system of contrast (other system parameters remain unchanged):

When the ADC board cache transfer threshold = 22 o'clock, the whole system processing delay is 11.335 milliseconds; And when the ADC board cache transfer threshold = 1:00, the whole system processing delay is 10.006 milliseconds. Difference between the two reached more than 1300 microseconds. Table 2. The table also shows the deployment 2:00 when using the above two conditions, system-wide processing delays.

| System-wide processing delay<br>(ms)       | Deployed 1 | Deployment 2 |
|--------------------------------------------|------------|--------------|
| ADC board buffer Transfer<br>Threshold: 22 | 11.335     | 11.556       |
| ADC board cache transfer<br>thresholds: 1  | 10.006     | 10.220       |

These are only examples of the two factors on the whole system processing delays. In fact, when the system model is completed, the design of various types of modeling groups could try adjusting elements, with the task of analyzing the relationship between the processing delay to determine to determine system performance bottlenecks or sensitive elements.

#### 1.1.14: Pipeline processing overlap

The results of the implementation system model can help us find some easily overlooked details of system design. For example, in this case the deployment mode 1, the use of the CPI pipeline processing, so in each frame period CPI, F4 will be analyzed at the same time as in the previous two tasks: ① CPI for the current frame to complete a number of lines PRI transpose data calculated; ② CPI for the previous frame data, because they have accumulated a good 64 PRI transpose the data, the data is done directly windowed FFT. Then the task ② when to start it? Theoretically arrival frame synchronization in the CPI can begin. However, because the system uses CPI pipeline processing program, the specified task ② processing resources (in this program are the M2 DSP3) may exist in the pipeline, "emptying" delays. Of course, the design should avoid these tasks occur on DSP3 ③ with the task being performed pipeline competition.

In the initial model design, we are so designed that: In each of the first two PRI CPI by the FPGA board to provide the above tasks <sup>(2)</sup> start trigger signal. After running the model, as shown in Figure 23, we see that the execution condition (CPI partial enlarged view near the border).



DSP3 can be seen from the figure in the vicinity of the beginning of a 5.824 ms cycle occurs a PRI processing tasks competition. Based on the above results, we adjust the model for the use of the first three PRI triggered task <sup>(2)</sup>, let's look at the results, there is no pipeline overlap occurs (Figure 24).

